Web-Document Retrieval by Genetic Learning of Importance Factors for HTML Tags

نویسندگان

  • Sun Kim
  • Byoung-Tak Zhang
چکیده

In contrast to conventional documents, a Web document consists of a number of tags which provide hints on the structure of the documents. In this paper, we propose a Web-document retrieval method using the characteristics of HTML tags. This method learns the importance of tags from a training text set. We use a genetic algorithm for learning the importance weights. We also present a modi ed similarity measure which uses the tag information. Experiments have been performed on the TREC document collection consisting of 247,491 documents. Compared to the traditional IR method, the proposed method has achieved 15% improvement in average precision.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords...

متن کامل

A New Study on Using HTML Structures to Improve Retrieval

Locating useful information effectively from the World Wide Web (WWW) is of wide interest. This paper presents new results on a methodology of using the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. This methodology partitions the occurrences of terms in a document collection into classes according to the tags in which a particular term a...

متن کامل

Using the Structure of HTML Documents to Improve Retrieval

The World Wide Web (WWW) is a gigantic information resource, which is growing daily. As more and more data are added to the WWW, it is becoming increasingly difficult to effectively locate useful information from this environment. In this paper, we propose a method for making use of the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. Our st...

متن کامل

Enhanced Information Retrieval by Using HTML Tags

Whenever digital libraries or knowledge management systems are to be automatically filled with web pages from the internet, document classification of the web pages is one of the major challenges. We present an approach which uses HTML tags in order to improve the quality of the hypertext document classification. Our approach uses weighting of HTML tags for separating relevant information in hy...

متن کامل

Significance of HTML Tags for Document Indexing and Retrieval

Indexing quality has an overwhelming effect on retrieval effectiveness of search engines. In the past few years it has become one of the major challenges in the search engines area, particularly the task of automatically assigning highquality terms to Web documents, which remains elusive. High indexing and retrieval quality requires work on term selection algorithms. This paper investigates the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000